Parallelization of MIRA Whole Genome and EST Sequence Assembler
نویسندگان
چکیده
The genome assembly problem is to generate the original DNA sequence of the organism from a large set of short overlapping fragments. MIRA is an open source assembler based on the Overlap Layout Consensus (OLC) graph model which addresses the assembly problem and is widely used by biologists [1,2]. Like other assemblers MIRA takes a long time to compute the assembly for large number of sequences. For example it takes around 24 hours to assemble a dataset with 1.4 million DNA sequence fragments and takes even longer for EST assemblies [3]. In this paper, we report our efforts in parallelizing MIRA assembler. The task of parallelizing MIRA assembler is challenging as it has critical code segments which are inherently sequential but are crucial for a good quality assembly. Initially, we focus on the two time consuming kernels of the MIRA assembler: graph construction, and edge weight calculation algorithm. Our implementation of the parallel code on a machine with 2 Intel Xenon X5550 Nehalem-EP quad core processors achieves linear speedup for these kernels.
منابع مشابه
Efficient de novo assembly of large genomes using compressed data structures.
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA...
متن کاملDetermination of Genetic diversity of cultivated chickpea (Cicer arietinum L.) using Medicago truncatula EST-SSRs
Expressed sequence tags simple sequence repeats (EST-SSRs) are important sources for investigation of genetic diversity and molecular marker development. Similar to genomic SSRs, the EST-SSRs are useful markers for many applications in genetics and plant breeding such as genetic diversity analysis, molecular mapping and cross-transferability across related species and genera. In spite of low po...
متن کاملAssessing the Impact of Assemblers on Virus Detection in a De Novo Metagenomic Analysis Pipeline
Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide...
متن کاملExtreme-Scale De Novo Genome Assembly
De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMER, a high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. Genome assembly software has many components,...
متن کاملPeanut (Arachis hypogaea) Expressed Sequence Tag Project: Progress and Application
Many plant ESTs have been sequenced as an alternative to whole genome sequences, including peanut because of the genome size and complexity. The US peanut research community had the historic 2004 Atlanta Genomics Workshop and named the EST project as a main priority. As of August 2011, the peanut research community had deposited 252,832 ESTs in the public NCBI EST database, and this resource ha...
متن کامل